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Abstract 

In  this  paper  we  address  the  problem  of  the  identification  of  text  in  noisy  document  im¬ 
ages.  We  are  especially  focused  on  segmenting  and  identifying  between  handwriting  and 
machine  printed  text  because:  1)  handwriting  in  a  document  often  indicates  corrections, 
additions,  or  other  supplemental  information  that  should  be  treated  differently  from  the 
main  content,  and  2)  the  segmentation  and  recognition  techniques  requested  for  machine 
printed  and  handwritten  text  are  significantly  different.  A  novel  aspect  of  our  approach 
is  that  we  treat  noise  as  a  separate  class  and  model  noise  based  on  selected  features. 
Trained  Fisher  classifiers  are  used  to  identify  machine  printed  text  and  handwriting  from 
noise,  and  we  further  exploit  context  to  refine  the  classification.  A  Markov  Random  Field 
(MRF)  based  approach  is  used  to  model  the  geometrical  structure  of  the  printed  text, 
handwriting,  and  noise  to  rectify  misclassifications.  Experimental  results  show  that  our 
approach  is  robust  and  can  significantly  improve  page  segmentation  in  noisy  document 
collections. 

Keywords:  Text  Identification,  Handwriting  Identification,  Markov  Random  Field, 
Post-Processing,  Noisy  Document  Image  Enhancement,  Document  Analysis 
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1  Introduction 


Documents  are  the  results  of  a  set  of  physical  processes  and  conditions,  and  the  resulting 
document  can  be  viewed  as  consisting  of  layers  (letterhead,  content,  signatures,  annota¬ 
tions,  noise,  etc.  in  the  case  of  business  correspondence  for  example).  Document  analysis 
reverses  these  processes  to  segment  a  document  into  layers  with  different  physical  and  se¬ 
mantic  properties.  After  decades  of  research,  automatic  document  analysis  has  advanced 
to  a  point  where  text  segmentation  and  recognition  can  be  viewed  as  a  solved  problem 
in  clean,  well-constrained  documents.  However,  the  performance  degrades  quickly  when 
a  small  amount  of  noise  is  introduced.  For  example,  a  typical  bottom-up  page  segmenta¬ 
tion  method  starts  from  the  extraction  of  connected  components  [1,2].  Based  on  spatial 
proximity  and  size,  connected  components  are  then  merged  into  text  lines  and  zones. 
A  classification  process  is  then  used  to  identify  zone  types  (text,  tables,  images,  etc.). 
These  algorithms  work  well  on  clean  documents  where  zones  with  different  properties 
can  be  easily  separated.  However,  they  often  fail  on  noisy  documents  where  noise  mixes 
with  and/or  is  spatially  close  to  content  regions.  For  example,  Figs.  1(a)  and  (b)  show 
segmentation  results  for  an  extremely  noisy  document  when  we  use  the  Docstrum  algo¬ 
rithm  [2]  and  ScanSoft  SDK  [3] .  Text  and  noise  are  erroneously  segmented  into  the  same 
zones  by  both  algorithms. 

In  this  paper  we  present  a  novel  approach  to  identifying  text  in  extremely  noisy  doc¬ 
uments.  Instead  of  simple  noise  filtering,  as  used  in  other  work  [1,2],  we  treat  noise 
as  a  distinguished  class  and  model  it  based  on  selected  features.  We  further  identify 
handwriting  from  machine  printed  text  since:  1)  handwriting  in  a  document  often  indi¬ 
cates  corrections,  additions,  or  other  supplemental  information  that  should  be  treated 
differently  from  the  main  content,  and  2)  segmentation  and  recognition  techniques  for 
machine  printed  text  and  handwriting  are  significantly  different.  Based  on  these  consid¬ 
erations,  we  treat  the  problem  as  a  three-class  (machine  printed  text,  handwriting  and 
noise)  identification  problem. 

In  practice  misclassification  often  happens  in  an  overlapping  feature  space.  This  is 
especially  true  for  handwriting  and  noise.  To  deal  with  this  problem,  we  exploit  contex¬ 
tual  information  in  post-processing  and  refine  the  classification.  Contextual  information 
is  very  useful  for  improving  classification  accuracy.  It  is  widely  used  in  many  OCR  sys¬ 
tems  and  its  effectiveness  has  been  demonstrated  in  previous  work  [4,5].  The  key  is 
to  model  the  statistical  dependency  among  neighboring  components.  The  output  of  an 
OCR  system  is  a  text  stream  which  is  one-dimensional.  Therefore,  an  N-gram  language 
model,  based  on  an  Nth  order  1-D  Markov  chain,  is  effective  for  modeling  the  context. 
With  assistance  from  a  dictionary,  the  N-gram  approach  can  correct  most  recognition 
errors.  Images,  however,  are  two-dimensional.  Generally,  2-D  signals  are  not  causal, 
and  it  is  much  harder  to  model  the  dependency  among  neighboring  components  in  an 
image.  Among  the  image  models  studied  so  far,  Markov  Random  Fields  (MRF)  have 
been  widely  studied  and  successfully  used  in  many  applications.  MRFs  are  suitable  for 
image  analysis  because  the  local  statistical  dependency  of  an  image  can  be  well  mod¬ 
eled  by  Markov  properties.  MRFs  can  incorporate  a  priori  contextual  information  or 
constraints  in  a  quantitative  way.  The  MRF  model  has  been  extensively  used  in  various 
image  analysis  applications  such  as  texture  synthesis  and  segmentation,  edge  detection, 
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Figure  1:  Page  segmentation  results  for  an  extremely  noisy  document  using  the  Docst  rum 
algorithm  and  ScanSoft  SDK.  Noise  is  segmented  into  text  zones  erroneously  in  both 
cases,  (a)  Docstrum,  (b)  ScanSoft. 


and  image  restoration  [6,7].  In  this  paper,  we  use  MRFs  to  model  the  dependency  of  seg¬ 
mented  neighboring  blocks.  As  post-processing,  MRFs  can  further  improve  classification 
accuracy. 

The  documents  we  are  processing  are  extremely  noisy  with  machine  printed  text, 
handwriting,  and  noise  mixed  together.  We  first  extract  the  connected  components 
and  merge  them  at  the  word  level  based  on  spatial  proximity.  We  then  extract  several 
categories  of  features  and  use  trained  Fisher  classifiers  to  classify  each  word  into  machine 
printed  text,  handwriting,  or  noise.  Finally,  contextual  information  is  incorporated  into 
MRF  models  to  refine  the  classification  results  further. 

The  rest  of  the  paper  is  organized  as  follows:  Section  2  is  a  literature  survey  of  related 
work,  followed  by  a  detailed  description  of  our  classification  method  in  Section  3.  MRF- 
based  post-processing  is  presented  in  Section  4,  and  experimental  results  are  presented 
in  Section  5.  The  paper  concludes  with  a  brief  summary  and  a  discussion  of  future  work. 

2  Related  Work 

The  research  presented  in  this  paper  is  related  to  previous  work  on  page  segmentation, 
zone  classification,  handwriting  identification,  and  document  enhancement. 
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2.1  Page  Segmentation 

Previous  work  on  page  segmentation  can  be  broadly  divided  into  three  categories:  bottom- 
up  [1,2],  top-down  [8],  and  hybrid  [9].  In  a  typical  bottom-up  approach  such  as  the 
Docstrum  algorithm  proposed  by  O’Gorman  [2],  connected  components  are  extracted 
first  and  then  merged  into  words,  lines,  zones,  and  columns  hierarchically  based  on  size 
and  spatial  proximity.  Bottom-up  methods  can  handle  documents  with  complex  layouts. 
However,  this  is  time  consuming  and  sensitive  to  noise. 

A  typical  top-down  method,  such  as  the  X-Y  cuts  proposed  by  Nagy  [8],  starts  from 
the  whole  document  and  splits  it  recursively  into  columns,  zones,  lines,  words  and  char¬ 
acters.  Top-down  methods  are  effective  for  documents  with  regular  layouts,  but  fails 
when  the  documents  have  a  non-Manhattan  structure. 

Another  problem  with  X-Y  cuts  is  that  the  global  parameters  for  optimal  segmenta¬ 
tion  are  often  difficult  to  find  if  prior  knowledge  is  not  available.  Sylwester  et  al.  proposed 
a  hybrid  method  which  starts  from  the  top  [9].  First,  they  over-segment  a  document  into 
small  zones  using  the  X-Y  cut  algorithm.  Then  they  use  the  bottom-up  method  which 
groups  over-segmented  small  zones  with  the  same  properties  into  a  single  zone. 

All  of  the  above  methods  are  based  on  the  analysis  of  foreground  (black  pixels). 
As  an  alternative,  white  stream  methods  based  on  the  analysis  of  background  (white 
pixels)  are  presented  in  [10,11].  In  these  methods,  rectangles  covering  white  gaps  (white 
pixels)  between  foreground  are  extracted.  Foreground  regions  surrounded  by  these  white 
rectangles  are  extracted  as  zones.  A  more  comprehensive  survey  is  presented  in  [12], 

2.2  Zone  Classification 

Zone  classification  labels  the  content  of  each  segmented  zone  as  one  of  a  set  of  pre¬ 
defined  types  [1,11,13],  such  as  text,  images,  graphics,  and  tables.  Pavlidis  et  al.  used 
correlations  of  horizontal  scan  lines  as  features  to  distinguish  text  and  diagrams  from 
half-tone  images.  The  black  pixel  density  is  used  to  further  distinguish  diagrams  from 
text  [11].  Wang  et  al.  used  69  features,  such  as  run  length  mean  and  variance,  spatial 
mean  and  variance,  fraction  of  the  total  number  of  black  pixels  in  the  zone,  width  ratio 
of  the  zone,  and  number  of  text  glyphs  in  the  zone,  to  classify  each  zone  into  nine  classes. 
They  did  experiments  on  ground-truthed  zones  of  the  UW  III  database,  and  achieved 
an  accuracy  as  high  as  98.52%  [13].  Jain  et  al.  directly  performed  classification  on  the 
generalized  lines  (GTLs)  extracted  using  a  bottom-up  approach  [1],  If  the  height  of  a 
GTL  is  less  than  a  threshold  and  the  connected  components  in  it  are  horizontally  aligned, 
it  is  classified  as  a  text  line.  Text  lines  and  non-text  lines  are  merged  into  text  regions 
and  non-text  regions  respectively.  They  further  classify  non-text  regions  into  images, 
tables,  and  drawings.  This  works  well  for  long  text  lines,  but  may  fail  when  the  text  lines 
are  short. 

Some  other  approaches  treat  text,  images,  and  figures  as  different  textures,  and  use 
trained  classifiers  to  segment  and  identify  them  [14-16].  They  often  work  directly  on  gray 
scale  images,  and  need  classification  of  each  pixel.  To  reduce  the  computation  complexity, 
multi-resolution  techniques  are  often  used. 
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2.3  Handwriting  Identification 

Some  work  has  been  done  on  handwriting/machine  printed  text  identification.  The  classi¬ 
fication  is  typically  performed  at  the  text  line  [17-20],  word  [21],  or  character  level  [22,23]. 
At  the  line  level,  machine  printed  text  lines  are  typically  arranged  regularly  with  a  straight 
baseline,  while  handwritten  text  lines  are  irregular  with  a  varying  baseline.  Srihari  et 
al.  implemented  a  text  line  based  approach  using  this  characteristic  and  achieved  a 
classification  accuracy  of  95%  [20].  One  advantage  of  this  approach  is  that  it  can  be 
used  in  different  scripts  (Chinese,  English,  etc.)  with  little  or  no  modification.  Guo 
et  al.  proposed  an  approach  based  on  the  vertical  projection  profile  of  the  segmented 
words  [21],  They  used  a  Hidden  Markov  Model  (HMM)  as  the  classifier  and  achieved  a 
classification  accuracy  of  97.2%..  Although  at  the  character  level  less  information  is  avail¬ 
able,  humans  can  still  identify  the  handwritten  and  machine  printed  characters  easily, 
inspiring  researchers  to  pursue  classification  at  the  character  level.  Kuhnke  proposed  a 
neural  network-based  approach  with  straightness  and  symmetry  as  features  [22],  Zheng 
et  al.  used  run-length  histogram  features  to  identify  handwritten  and  printed  Chinese 
characters  and  achieved  promising  results  [23].  In  previous  work,  we  implemented  a 
handwriting  identification  method  based  on  several  categories  of  features  and  a  trained 
Fisher  classifier  [24],  However,  the  problems  introduced  by  noise  are  not  addressed. 

2.4  Document  Enhancement 

There  are  two  types  of  degradation  in  document  images:  1)  physical  degradation  of  the 
hardcopy  documents  during  creation,  and/or  storage,  and  2)  degradation  introduced  by 
digitization.  If  severe  enough,  either  of  them  can  reduce  the  performance  of  a  document 
analysis  system  significantly.  Several  document  degradation  models  [25-27],  methods 
for  document  quality  assessment  [28,29],  and  document  enhancement  algorithms  [30-32] 
have  been  presented  in  previous  work.  One  common  enhancement  approach  is  window- 
based  morphological  filtering  [30-32],  Morphological  filtering  performs  a  look  up  table 
procedure  to  determinate  an  output  of  ON  (black  pixel)  or  OFF  (white  pixel)  for  each 
entry  of  the  table,  based  on  a  windowed  observation  of  its  neighbors.  These  algorithms 
can  be  further  categorized  as  manually  designed,  semi-manually  designed,  or  automati¬ 
cally  trained  approaches.  The  kFill  algorithm,  proposed  by  O’Gorman  [32],  is  a  manually 
designed  approach  and  has  been  used  by  several  other  researchers  [28,33].  Experiments 
show  it  is  effective  for  removing  salt-and-pepper  noise.  Liang  et  al.  proposed  a  semi- 
manually  designed  approach  with  a  3  x  3  window  size  [34],  They  manually  determine 
some  entries  to  output  ON  or  OFF  based  on  a  priori  observations.  The  remaining  entries 
are  trained  to  select  the  optimal  output.  It  is  difficult  to  manually  design  a  filter  with  a 
large  window  size,  and  success  depends  on  experience.  If  both  ideal  and  degraded  images 
are  available,  optimal  filters  can  be  designed  by  training  [31].  After  registering  the  ideal 
and  degraded  images  at  the  pixel  level,  an  optimal  look-up  table,  based  on  observation  of 
the  outputs  of  each  specific  windowed  context,  can  be  designed.  However  it  is  difficult  to 
train,  store,  and  retrieve  the  look-up  table  when  the  window  size  is  large.  This  approach 
requires  both  the  original  and  the  corresponding  degraded  images  for  training.  Loce 
used  artificially  degraded  images  generated  by  models  for  training  [31],  while  Kanungo 
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et  al.  proposed  methods  for  validation  and  parameter  estimation  of  degradation  mod¬ 
els  [35-37].  Though  the  uniformity  and  sensitivity  of  his  approach  has  been  tested  by 
other  researchers  [27,38],  no  degradation  model  has  been  declared  to  pass  the  validation. 
Another  problem  with  morphological  approaches  is  the  small  window  sizes.  The  most 
commonly  used  window  size  is  no  larger  than  5x5,  which  is  too  small  to  contain  enough 
information  for  enhancement. 

Ideally  image  quality  should  be  estimated  first  so  the  appropriate  enhancement  al¬ 
gorithms  can  be  applied  automatically.  Cannon  et  al.  proposed  a  document  quality 
assessment  algorithm  based  on  five  factors:  small  speckle,  white  speckle,  touching  char¬ 
acters,  broken  characters,  and  font  size  [28].  They  used  a  linear  classifier  to  select  the  best 
one  out  of  four  enhancement  algorithms,  and  reduced  the  OCR  error  rate  from  20.27% 
to  12.60%  on  their  database.  Li  el  al.  proposed  an  approach  for  quality  estimation  of 
color  video  text,  which  classifies  the  video  text  quality  into  six  levels  [29]. 

A  majority  of  the  above  approaches  are  focused  on  improving  OCR  accuracy  in  noisy 
documents.  As  shown  in  Fig.  1,  degradation  will  not  only  deteriorate  OCR  performance, 
but  other  document  processing  tasks,  such  as  page  segmentation  as  well.  Little  work 
has  been  done  in  this  area.  The  difference  between  our  approach  and  previous  work  is 
that  we  perform  classification  to  identify  noise,  and  exploit  contextual  information  of 
neighboring  blocks  as  a  post-processing  to  refine  the  identification.  Experiments  show 
that  our  noise  removal  algorithm  can  increase  page  segmentation  accuracy  significantly. 

3  Text  Identification 

In  this  section  we  present  our  text  (machine  printed  or  handwritten)  extraction  and 
classification  method. 

3.1  Pattern  Unit 

Special  consideration  must  be  given  to  the  size  of  the  region  being  segmented  before 
we  can  perform  any  classification.  We  call  the  smallest  unit  for  classification  a  pattern 
unit.  If  the  unit  is  too  small,  the  information  contained  in  it  may  not  be  sufficient  for 
classification;  if  it  is  too  large,  however,  different  types  of  components  may  be  mixed 
in  the  same  region.  In  previous  work  we  conducted  a  performance  evaluation  for  the 
classification  accuracy  of  machine  printed  text  and  handwriting  at  the  character,  word, 
and  zone  levels,  and  showed  that  a  reliable  classification  can  be  achieved  at  the  word  level 
[24],  We  therefore  segment  images  at  the  the  word  level  and  then  perform  classification. 
Since  noise  has  no  concept  of  word ,  we  use  the  terminology  block  and  word  interchangeably 
in  the  following  presentation. 

We  first  extract  connected  components,  and  then  merge  them  into  words  based  on 
geometric  proximity  and  size.  Those  extremely  large  word  blocks  or  blocks  with  very 
large  or  small  aspect  ratios  are  filtered  out.  However,  noise  with  size  similar  to  text 
cannot  be  filtered  out.  Our  focus  is  to  distinguish  text  from  this  type  of  noise. 
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Table  1:  Features  used  for  machine  printed  text /handwriting/noise  classification 


Feature  set 

Feature  description 

#  of  features 

#  of  features  selected 

Structural 

Region  size,  connected  components 

18 

9 

Gabor  filter 

Stroke  orientation 

16 

4 

Run-length  histogram 

Stroke  length 

20 

5 

Crossing  count  histogram 

Stroke  complexity 

10 

6 

Bi-level  co-occurrence 

Texture 

16 

2 

2x2  gram 

Texture 

60 

5 

Total 

140 

31 

d 


A 

(c) 

Figure  2:  Illustration  of  feature  extraction,  (a)  The  overlap  area  of  the  connected  compo¬ 
nents  inside  a  pattern  unit  is  extracted  as  a  structural  feature,  (b)  Run-length  histogram 
features,  (c)  Crossing  count  features.  The  crossing  counts  of  the  top  and  bottom  hori¬ 
zontal  scan  lines  are  1  and  2  respectively,  (d)  Bi-level  2x2  gram  features. 


b-  d— 1 

0/1 

0/1 

(d) 


0/1 

0/1 
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3.2  Feature  Extraction 


Several  sets  of  features  are  extracted  for  classification.  The  descriptions  and  sizes  of  the 
feature  sets  are  listed  in  Table  1.  Machine  printed  text,  handwriting,  and  noise  have 
different  visual  appearances  and  physical  structures.  Structural  features  are  extracted 
to  reflect  these  differences.  Gabor  filter  features  and  run-length  histogram  features  can 
capture  the  difference  in  stroke  orientation  and  stroke  length  between  handwriting  and 
printed  text.  Compared  with  text,  noise  blocks  often  have  simple  stroke  complexity. 
Therefore,  crossing  count  histogram  features  are  exploited  to  model  such  differences.  We 
further  take  regions  of  machine  printed  text,  handwriting,  and  noise  blocks  as  different 
textures.  Two  sets  of  bi-level  texture  features  (bi-level  co-occurrence  features  and  bi-level 
2x2  gram  features)  are  used  for  classification.  In  the  following  subsections  we  present 
these  features  in  detail. 

3.2.1  Structural  Features 

We  extract  two  sets  of  structural  features.  The  first  set  includes  features  related  to  the 
physical  sizes  of  the  blocks  such  as  density  of  black  pixels,  width,  height,  aspect  ratio, 
and  area.  Suppose  the  image  of  the  block  is  I(x,  y),  0  <  x  <  w,  0  <  y  <  h.  and  w,  h  are 
its  width  and  height  respectively.  Each  pixel  in  the  block  has  two  values:  0  representing 
background  (a  white  pixel)  and  1  representing  content  (a  black  pixel).  Then  the  density 
of  the  black  pixels  d  is 

w  —  1  h—  1 

E  E  I(x,y) 

x=0y=0  / 1  \ 


The  sizes  of  machine  printed  words  are  more  consistent  than  those  of  handwriting  and 
noise  on  the  same  page.  However,  machine  printed  words  on  different  pages  may  vary 
significantly.  Therefore,  we  use  a  histogram  technique  to  estimate  the  dominant  font 
size  [2],  and  then  use  the  dominant  font  size  to  normalize  the  width  (w),  height  (h), 
aspect  ratio  (r),  and  area  (a)  of  the  block. 

The  second  set  of  structural  features  are  based  on  the  connected  components  inside 
the  block,  such  as  the  mean  and  variance  of  the  width  (mw  and  aw),  height  (m/(,  and 
aspect  ratio  (rnr  and  ar ),  and  area  (ma  and  aa)  of  connected  components.  The  sizes 
of  connected  components  inside  a  machine  printed  word  are  more  consistent,  leading 
to  smaller  aw  and  07,.  For  a  handwritten  word  or  noise  block,  the  bounding  boxes  of 
the  connected  components  tend  to  overlap  with  each  other,  as  shown  in  Fig.  2(a).  For 
machine  printed  English  words,  however,  each  character  forms  a  connected  component 
not  overlapping  with  others.  The  overlapping  area  (the  sum  of  the  areas  of  the  gray 
rectangles  in  Fig.  2(a))  normalized  by  the  total  area  of  the  block  is  calculated  as  a 
feature.  Another  feature  we  use  is  the  variance  of  the  vertical  projection.  In  a  machine 
printed  text  block,  the  vertical  projection  profile  has  obvious  valleys  and  peaks  since 
neighboring  characters  do  not  touch  each  other.  However,  for  a  handwritten  word  or 
noise  block,  the  vertical  projections  are  much  smoother,  resulting  in  smaller  variance. 
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3.2.2  Gabor  Filter  Features 


Gabor  filters  can  represent  signals  in  both  the  frequency  and  time  domains  with  minimum 
uncertainty  [39]  and  have  been  widely  used  for  texture  analysis  and  segmentation  [15]. 
Researchers  found  that  they  match  the  mammalian  visual  system  very  well,  which  pro¬ 
vides  further  evidence  that  we  can  use  it  in  our  classification  tasks.  In  the  spatial  and 
frequency  domains,  the  two-dimensional  Gabor  filter  is  defined  as 


g(x,  y )  =  exp 


—7 T 


/  2 


X 


at 


x  cos{27r('U0a:  +  v0y)} 


(2) 


G(u,  v )  =  2'KoxOy{exp{-'K[{u'  -  n'0)2al  +  (v1  -  p'0 )2a2]}  + 

exp{-n[(u>  +  u'0)2a2x  +  (v>  +  v'0)2a2]})  (3) 

where  x'  =  —x  sin  9  +  y  cos  9 ,  y'  =  —x  cos  9  —  y  sin  9 ,  u'  =  u  sin  9  —  v  cos  9 ,  v'  =  —  u  cos  9  — 
v  sin  9 ,  u'Q  =  —u0  sin  9  +  v0  cos  9 ,  v'0  =  —  u0  cos  9  —  v0  sin  9 ,  u0  =  /  cos  9 ,  and  v0  =  /  sin  9. 
Here  /  and  9  are  two  parameters,  representing  the  central  frequency  and  orientation  of 
the  Gabor  filter. 

The  variances  of  the  filtered  images  are  taken  as  features.  In  our  experiments  16 
Gabor  filters  with  different  orientations  9^  =  k  x  180/JV,  k  =  1,  2, ...  16,  are  used,  which 
generate  16  features. 


3.2.3  Run-length  Histogram  Features 

Run-length  histogram  features  are  proposed  in  [23]  for  machine  printed/  handwritten 
Chinese  character  classification.  These  features  are  used  in  our  case  to  capture  the 
difference  between  the  stroke  lengths  of  machine  printed  text,  handwriting,  and  noise 
blocks.  First,  black  pixel  run-lengths  in  four  directions,  including  horizontal,  vertical, 
major  diagonal,  and  minor  diagonal,  are  extracted.  We  then  calculate  four  histograms 
of  run-lengtlis  for  these  four  directions,  as  shown  in  Fig.  2(b).  To  get  scale-invariant 
features,  we  normalize  the  histograms.  Suppose  C/ ,  k  =  1,2 , ...,  iV,  is  the  number  of  runs 
with  length  k,  and  N  is  the  maximal  length  of  all  possible  runs,  then  the  normalized 
histogram  Ck  is 

q  =  W-  (4) 

EC, 

2=1 

We  then  divide  the  histogram  into  five  bins  with  equal  width  and  use  five  Gaussian-sliaped 
weight  windows  to  get  the  final  features  (Fig.  2(b)).  Taking  the  horizontal  run-length 
histogram  as  an  example,  the  run-length  histogram  feature  B.h,  is  calculated  as 

W 

Rhi  =  E  G(k;  Ui,  a )C'k,  i  =  1,  2,  3,  4,  5  (5) 

k=l 

where  w  is  the  width  of  the  block  (the  maximal  length  of  all  possible  horizontal  run- 
lengths)  and  G(k]  Uj,a )  is  a  Gaussian-shaped  function: 

G(k]  Ui,  a)  =  exp  l-  ^  |  (6) 
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As  shown  in  Figure  2(b),  a  is  chosen  so  the  weight  on  each  bin  border  is  0.5.  Another 
alternative  is  to  use  rectangular  windows  without  overlap  between  neighboring  bins. 
Experiments  show  that  the  extracted  features  with  Gaussian  weighted  windows  are  more 
robust.  Five  features  are  extracted  in  each  direction,  leading  to  20  features. 

3.2.4  Crossing  Count  Histogram  Features 

A  crossing  count  is  the  number  of  times  the  pixel  value  changes  from  0  (white  pixel)  to 
1  (black  pixel)  along  a  horizontal  or  vertical  raster  scan  line.  As  shown  in  Figure  2(c), 
the  crossing  counts  of  the  top  and  bottom  horizontal  scan  lines  are  1  and  2  respectively. 
Crossing  counts  can  be  used  to  measure  stroke  complexity  [24,40].  In  our  approach,  first 
the  crossing  count  for  each  horizontal  and  vertical  scan  line  is  calculated.  Similarly  we 
get  two  histograms  for  the  horizontal  and  vertical  crossing  counts  respectively.  The  same 
technique  (as  in  extracting  the  run-length  histogram  features)  is  exploited  to  get  the  final 
features  from  the  histograms.  A  total  of  10  features  are  extracted. 

3.2.5  Bi-level  Co-occurrence  Features 

A  co-occurrence  count  is  the  number  of  times  a  given  pair  of  pixels  occurs  at  a  fixed  dis¬ 
tance  and  orientation  [41],  In  the  case  of  binary  images,  the  possible  co-occurrence  pairs 
are  white-white,  black-white,  white-black  and  black-black.  In  our  case,  we  are  concerned 
primarily  with  the  foreground.  Since  the  white  background  region  often  accounts  for 
up  to  80%  of  a  document  page,  the  occurrence  frequency  of  white-white  or  white-black 
pixel  pairs  will  always  be  much  higher  than  that  of  black-black  pairs.  The  black-black 
pairs  carry  most  of  the  information.  To  eliminate  the  redundancy  and  reduce  the  effects 
of  over-emphasizing  the  background,  we  consider  only  black-black  pairs.  Four  different 
orientations  (horizontal,  vertical,  major  diagonal  and  minor  diagonal)  and  four  distance 
levels  (1,  2,  4,  and  8  pixels)  are  used  for  classification  (16  features  total).  The  horizontal 
co-occurrence  count  Ch{d ),  for  example,  is  defined  as 

Chid)  =  J2JlIix^y)Iix  +  d^y)^d  =  M’4’8  (7) 

x  y 

I(x,  y)  =  0  for  white  pixels;  therefore  only  black-black  pixel  pairs  contribute.  For  a  fixed 
distance  d  we  normalize  the  occurrence  by  dividing  by  the  sum  of  the  occurrences  in  all 
four  directions. 

3.2.6  Bi- level  2x2  gram  Features 

The  N  x  M  grams  were  first  introduced  in  the  context  of  image  classification  and  retrieval 
[42],  An  NxM  gram  extends  the  one-dimensional  co-occurrence  feature  to  the  two- 
dimensional  case.  We  only  consider  2x2  grains,  which  count  the  numbers  of  occurrences 
of  the  patterns  shown  in  Figure  2(d).  The  cells  labeled  0/1  should  take  specific  values, 
and  the  values  of  other  cells  are  irrelevant.  Therefore  there  are  24  =  16  patterns  for  each 
distance  d.  Like  the  co-occurrence  features,  the  all  white  patterns  are  removed  to  reduce 
over-emphasis  on  the  background.  For  a  fixed  distance,  the  occurrences  are  normalized 
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by  dividing  by  the  sum  of  all  occurrences.  Four  distances  (1,  2,  4,  and  8  pixels)  are 
chosen,  generating  4  x  15  =  60  features. 

3.3  Feature  Selection 

There  are  two  purposes  for  feature  selection.  First,  reducing  the  computation  needed 
for  feature  extraction  and  classification.  As  shown  in  Table  1  we  extract  a  total  of  140 
features  from  the  segmented  blocks.  Though  these  features  are  designed  to  distinguish 
between  different  types  of  blocks,  some  features  may  contain  more  information  than  oth¬ 
ers.  Using  only  a  small  set  of  the  most  powerful  features  reduces  the  time  for  feature 
extraction  and  classification.  The  second  purpose  is  to  alleviate  the  curse  of  dimension¬ 
ality.  When  the  number  of  training  samples  is  limited,  using  a  large  feature  set  may 
decrease  the  generality  of  a  classifier  [43].  The  larger  the  feature  set,  the  more  training 
samples  are  needed.  Therefore,  we  perform  feature  selection  before  feeding  the  features 
to  the  classifier. 

We  use  a  forward  search  algorithm  to  perform  feature  selection  [44],  We  first  divide 
the  whole  feature  set  T  into  a  currently  selected  feature  set  Ts  and  an  un-selected  feature 
set  Tn  which  satisfy 


Ts  U  Tn  =  T 

The  selection  procedure  can  then  be  described  as 

1.  Set  JFS  =  $,  and  Tn  =  T . 

2.  Label  all  features  in  as  un-tested. 

3.  Select  one  un-tested  feature  /  G  Tn  and  label  it  as  tested. 

4.  Put  /  and  Ts  together,  and  generate  a  temporary  selected  feature  set  jFgf. 

5.  Estimate  the  classification  accuracy  with  feature  set  Tfs  using  a  1-NN  classifier  and 
leave-one-out  cross  validation  technique,  The  basic  idea  is  that  at  each  iteration 
only  one  sample  is  used  for  testing,  while  the  others  have  been  used  for  training. 
We  repeat  this  process  until  all  samples  have  been  used  as  testing  samples  once. 
The  average  accuracy  for  all  iterations  is  taken  as  the  estimated  accuracy  for  the 
current  feature  set.  The  leave-one-out  cross  validation  technique  can  estimate  the 
accuracy  of  a  classifier  with  small  variation  [43]. 

6.  If  there  are  un-tested  features  in  jFn,  goto  step  3. 

7.  Find  a  feature  /  G  JFn,  such  that  the  corresponding  temporary  feature  set  Tl  has 
the  highest  classification  accuracy: 

f  =  cirg  max  Accuracy (iFsf)  (10) 


(8) 

(9) 


then  move  /  from  jFn  to  Ts. 
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(a)  (b) 

Figure  3:  Feature  analysis,  (a)  Feature  selection:  the  best  classification  result  is  achieved 
when  31  features  are  selected,  (b)  PCA:  the  best  classification  result  is  achieved  when 
64  principal  components  are  used. 


8.  If  %  <F,  go  to  step  2;  otherwise  exit. 

We  use  LNIvnet  pattern  classification  software  to  conduct  our  feature  selection  experi¬ 
ments  [45].  LNIvnet  provides  several  classifiers,  such  as  likelihood  classifiers,  k-NN  clas¬ 
sifiers,  and  Neural  Network  classifiers,  and  several  feature  selection  algorithms  such  as 
forward  search,  backward  search,  and  forward  and  backward  search.  Feature  selection 
can  be  an  extremely  expensive  task.  Considering  the  large  number  of  feature  sets  to 
evaluate,  and  the  number  of  classifiers  to  train,  the  lightweight  forward  feature  selec¬ 
tion  algorithm  and  1-NN  classifier,  which  does  not  need  training,  are  used  in  our  feature 
selection  experiment. 

We  collected  about  1,500  blocks  for  each  class.  As  shown  in  Fig.  3(a),  when  the 
number  of  selected  features  increases  the  error  rate  decreases  sharply  at  first.  The  trend 
reverses  at  some  point.  The  best  classification  is  achieved  when  only  31  features  are 
selected,  with  an  error  rate  of  5.7%.  When  all  features  are  used,  the  error  rate  increases 
to  9.2%  due  to  the  limited  number  of  training  samples  and  large  feature  set.  The  last 
column  in  Table  1  lists  the  number  of  features  selected  in  each  set.  It  shows  that  texture 
features,  such  as  bi-level  co-occurrence  and  2x2  grains,  are  less  discriminating  than  other 
feature  sets,  mainly  due  to  the  small  region  size.  Only  1/8  of  the  bi-level  co-occurrence 
features  and  1/12  of  the  2x2  grain  features  are  selected.  Crossing  count  histogram 
features  and  structural  features  are  very  effective,  with  more  than  half  of  the  original 
features  in  both  sets  selected  in  the  final  feature  set. 

Principal  Component  Analysis  (PCA)  is  another  technique  for  reducing  feature  di¬ 
mension  [43].  To  extract  the  first  n  principal  components,  we  need  to  search  a  subspace 
of  dimension  n  with  basis  w.  Suppose  the  mean  is  already  removed  from  the  feature 
vector  A.  and  let  the  projection  of  A  onto  this  subspace  be  X_ 

X_  =  (w^JQwx  +  (tc^A)tt)2  +  •  •  •  +  {w^X_)mn  (11) 

PCA  finds  the  optimal  subspace  w  such  that  the  energy  contained  in  X_  is  maximized: 


w  =  arg  max 

mi 


n 

£  Var  [£] 
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1  if  i  =  j 
0  if  i  /  j 


(12) 


s.t. 


wfwj  = 


The  optimal  basis  is  the  first  n  eigenvectors  of  the  covariance  matrix  of  X_ ,  correspond¬ 
ing  to  the  first  n  eigenvalues  [43].  The  first  n  principal  components  are  Pj  =  «•/  V.  i  = 
1 , , . . ,  n.  The  idea  of  PCA  is  to  concentrate  the  energy  into  the  first  several  principal 
components.  Assuming  the  classification  information  is  contained  in  the  energy,  the  first 
several  principal  components  are  more  powerful  than  the  remaining  components.  Fur¬ 
thermore,  PCA  analysis  can  remove  the  correlation  among  features.  As  in  the  feature 
selection  experiment,  the  1-NN  classifier  and  the  leave-one-out  technique  are  used  to 
estimate  the  classification  accuracy.  Figure  3(b)  shows  the  classification  error  rate  ver¬ 
sus  the  number  of  principal  components  used.  As  in  feature  selection,  the  error  rate 
downs  quickly  at  first  until  16  principal  components  added.  The  minimal  error  rate, 
8.5%,  is  achieved  when  64  principal  components  are  used.  Compared  with  the  minimum 
error  rate  of  5.7%  achieved  by  the  feature  selection  technique,  PCA  is  not  as  powerful 
as  feature  selection  in  this  problem.  Furthermore,  to  perform  PCA,  all  features  must 
be  extracted  first.  However,  for  feature  selection,  we  only  need  to  extract  the  desired 
features,  which  would  increase  the  feature  extraction  speed.  Therefore,  in  the  following, 
we  do  classification  on  the  31  selected  features. 


3.4  Classification 

Compared  with  the  Neural  Network  (NN)  and  the  Support  Vector  Machine  (SVM),  the 
Fisher  classifier  is  easier  to  train,  faster  for  classification,  needs  fewer  training  samples, 
and  does  not  suffer  from  over-training  problems.  According  to  the  comparison  experiment 
in  Subsection  5.2,  the  SVM  classifier  performs  slightly  better  than  the  Fisher  classifier, 
but  the  latter  is  much  faster;  we  therefore  use  it  for  classification. 

For  a  feature  vector  V,  the  Fisher  classifier  projects  V  onto  one  dimension  Y  in 
direction  W 

Y  =  WTX  (13) 

The  Fisher  criterion  finds  the  optimal  projection  direction  W0  by  maximizing  the  ratio 
of  the  between-class  scatter  to  the  within-class  scatter,  which  benefits  the  classification. 
Let  S_w  and  <S6  be  the  within-  and  between-class  scatter  matrices  respectively, 


K 


51  (A-  uk){x-  uk)T 

^a&tlass  k 

(14) 

I< 

.  =  5 ~  Mo){Uk  -Mo)T 

k=l 

(15) 

II 

"i? 

(16) 

where  uk  is  the  mean  vector  of  the  kth  class,  u0  is  the  global  mean  vector,  and  K  is 
the  number  of  classes.  The  optimal  projection  direction  is  the  eigenvector  of  5% 1  S_h 
corresponding  to  its  largest  eigenvalue  [43].  For  a  two-class  classification  problem,  we  do 
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not  need  to  calculate  the  eigenvectors  of  SL^Sj,.  It  is  shown  that  the  optimal  projection 
direction  is 

W0  =  S-\ul-u2)  (17) 

Let  1 1  and  Y>  be  the  projections  of  two  classes  and  let  E[l'i]  and  E[l2]  be  the  means  of 
Y\  and  Y>  respectively.  Suppose  E[lj]  >  E[F2],  then  the  decision  can  be  made  as 


C{X) 


class  1  If  Y  >  (E[Li]  +  E[F2])/2 

class  2  Otherwise 


(18) 


It  is  known  that  if  the  feature  vector  A  is  jointly  Gaussian  distributed,  and  the  two  classes 
have  the  same  covariance  matrices,  then  the  Fisher  classifier  is  optimal  in  a  minimum 
classification  error  sense  [43]. 

The  Fisher  classifier  is  often  used  for  two-class  classification  problems.  Although  it 
can  be  extended  to  multi-class  classification  (three  classes  in  our  case),  the  classification 
accuracy  decreases  due  to  the  overlap  between  neighboring  classes.  Therefore,  we  use 
three  Fisher  classifiers,  each  optimized  for  a  two-class  classification  problem  (machine 
printed  text/handwriting,  machine  printed  text/noise,  and  handwriting/noise).  Each 
classifier  outputs  a  confidence  in  the  classification  and  the  final  decision  is  made  by 
fusing  the  outputs  of  all  three  classifiers. 


3.5  Classification  Confidence 


In  a  Fisher  classifier,  the  feature  vector  is  projected  onto  an  axis  on  which  the  ratio 
of  between-class  scatter  to  within-class  scatter  is  maximized.  According  to  the  central 
limit  theorem  [46],  the  distribution  of  the  projection  can  be  approximated  by  a  Gaussian 
distribution,  if  no  feature  has  dominant  variance  over  the  others,  as  follows: 


fr(y) 


, —  exp 
\Z2ttct 


1  ( y  —  m 

2  V  a 


(19) 


where  fy{y)  is  the  probability  density  function  of  the  projection.  The  parameters  m  and 
a  can  be  estimated  from  training  samples.  The  classification  confidence  Cjd  of  class  i 
using  classifier  j  is  defined  as 

„  f  — — - 1 — A  (.(//xeclass  i)  - - -  If  i  is  applicable  for  classifier  j. 

Cid  =  <  fY(y/2L£ class  i)+jV  (®/xeanother  class)  1  1  J 

[  0  Otherwise 

(20) 

where  i  is  the  class  label  and  j  represents  the  trained  classifiers.  If  a  classifier  is  trained 
to  classes  1  and  2,  its  output  is  not  applicable  to  estimating  the  classification  confidence 
of  class  3.  Therefore,  C%d  =  0.  The  final  classification  confidence  is  defined  as 


a*5E^ 

-  j=i 


(21) 


Cid  G  [0, 1]  for  the  two  applicable  classifiers  and  Cid  =  0  for  the  third  classifier,  Cj  G  [0, 1]. 
However,  Cj  is  not  a  good  estimate  of  the  a  posteriori  probability  since  =  1.5 
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instead  of  1.  We  can  take  C,;  as  an  estimate  of  a  non-decreasing  function  of  the  a  posteriori 
probability,  which  is  a  kind  of  generalized  classification  confidence  [47] . 

Fig.  4  shows  the  word  segmentation  and  classification  results  (with  the  Fisher  classi¬ 
fier)  for  the  whole  and  parts  of  a  document  image,  with  blue,  red,  and  green  rectangles 
representing  machine  printed  text,  handwriting,  and  noise  respectively.  We  can  see  that 
most  of  the  blocks  are  correctly  classified.  However  some  blocks  are  misclassified  due 
to  overlap  in  the  feature  space.  For  example,  some  noise  blocks  are  classified  as  hand¬ 
writing  in  Fig.  4(b),  and  some  small  printed  words  are  classified  as  noise  in  Fig.  4(c). 
Since  very  little  information  is  available  in  such  small  areas,  it  is  very  hard  to  get  good 
results.  In  next  section,  we  present  a  method  of  Markov  Random  Field  (MRF)  based 
post-processing  to  refine  the  classification  by  incorporating  contextual  information. 

4  MRF-Based  Post-Processing 
4.1  Background 

Let  X_  denote  the  random  field  defined  on  Q  and  let  F  denote  the  set  of  all  possible 
configurations  of  X_  on  Q.  X_  is  an  MRF  with  respect  to  the  neighborhood  r]  if  it  has  the 
following  Markov  property 

Pr(W  =  x)  >  0  for  all  iGT  (22) 

P(xs/xr ,  r  G  12,  r  jtz  s)  =  P(xs/xr ,  r  G  r])  (23) 

Compared  with  Markov  chains,  one  difficulty  with  MRFs  is  that  there  is  no  chain 
rule  for  MRFs.  The  joint  probability  P(X_  =  %)  cannot  be  recursively  written  in  terms  of 
local  conditional  probabilities  P(xs/xr,r  G  i )).  Therefore  it  is  difficult  to  get  an  optimal 
estimate  of  the  MRF  X_  which  maximizes  the  a  posteriori  probability 

X_  =  arg  max  PLY /Y)  (24) 

X 

The  establishment  of  the  connection  between  the  MRF  and  Gibbs  distribution  provides 
a  way  to  optimize  of  the  MRF.  To  maximize  the  a  posteriori  probability  of  the  MRF,  we 
need  to  minimize  the  total  energy  of  the  corresponding  Gibbs  distribution 

X  =  arg  min  ^  V'c(i)  (25) 

-  cec 

Here,  a  clique  c.  is  defined  as  a  subset  of  sites  in  which  every  pair  of  distinct  sites  are 
neighbors.  The  clique  potential  VC(X_)  is  the  energy  associated  with  a  clique,  and  de¬ 
pends  on  the  local  configuration  on  clique  c.  Therefore,  the  optimization  problem  (24)  is 
converted  to  another  optimization  problem  (25).  The  information  about  the  observation 
Y_  is  contained  in  the  clique  system. 

In  the  study  of  MRFs,  the  problems  are  often  posed  as  labeling  problems  in  which  a 
set  of  labels  are  assigned  to  sites  of  an  MRF  [7].  In  our  problem,  each  block  constitutes 
a  site  of  an  MRF.  A  label  (as  one  of  machine-printed  text,  handwriting,  and  noise)  is 
assigned  to  each  block,  and  context  information  (encoded  by  the  MRF  model)  is  used 
to  flip  the  labels  so  that  the  total  energy  of  the  corresponding  Gibbs  distribution  is 
minimized.  Relaxation  algorithms  are  often  used  for  MRF  optimization  [7]. 
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Figure  4:  Word  block  segmentation  and  classification  results.  Blue,  red,  and  green  rep¬ 
resent  machine  printed  text,  handwriting,  and  noise,  respectively,  (a)  A  whole  document 
image,  (b)  and  (c)  two  parts  of  the  image  in  (a). 
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(a) 


(b) 


Figure  5:  Clique  definition,  (a)  Cp  for  horizontally  arranged  machine  printed  words,  (b) 
Cn  for  noise  blocks. 


4.2  Clique  Definition 

As  shown  in  (25),  the  MRF  is  totally  determined  by  clique  c  and  clique  potential  VC(X_). 
The  design  of  the  clique  and  its  potential  is  crucial,  but  a  systematic  method  is  not  yet 
available.  In  our  case,  machine  printed  text,  handwriting,  and  noise  exhibit  different 
patterns  of  geometric  relationships.  Our  definition  of  cliques  reflects  these  differences. 

Printed  words  often  form  horizontal  (or  vertical)  text  lines.  Clique  Cp  is  defined  in 
Fig.  5(a),  which  models  contextual  constraints  on  neighboring  machine  printed  words. 
We  first  define  the  connection  between  word  blocks  i  and  j.  As  shown  in  Fig.  5(a),  Ov  is 
the  vertical  overlap  between  two  blocks,  and  Dh  is  the  horizontal  distance  between  two 
blocks.  The  distance  between  block  i  and  j  is 

D{i,j)  =  | Dh{i,j)  -  Gw |  +  | Hi  -  Hj  \  +  | Ch,  -  Chj\  (26) 

where  Dh(i,j )  is  the  horizontal  distances  between  words  i  and  j,  Gw  is  the  estimated 
average  word  gap  in  the  whole  document,  //,  and  Hj  are  the  heights  of  blocks  i  and  j 
respectively,  and  Chi  and  Ghj  are  the  vertical  centers  of  the  two  blocks.  Two  blocks  are 
connected  if  they  satisfy 

1.  Ov  >  min(Hi,Hj)/2 

2.  0  <Dh<  2 Gw 

3.  D(i,j)  <  Tp ,  where  Tp  is  a  threshold,  which  is  not  sensitive  to  post-processing. 

After  defining  the  connection  between  two  blocks  we  can  construct  a  graph  in  which 
nodes  represent  blocks  and  edges  link  two  connected  nodes.  The  property  of  an  edge  can 
be  measured  by  the  distance  D(i,  j )  between  two  blocks.  If  a  node  is  connected  with  more 
than  one  node  on  one  side  (left  or  right),  we  only  keep  the  edge  with  the  smallest  distance. 
Clique  Cp  can  be  represented  by  nodes  together  with  their  left  and  right  neighbors.  If  we 
cannot  find  neighbors  on  the  left  or/and  right  sides,  the  corresponding  neighbor  is  set  to 
NULL. 

Noise  blocks  exhibit  rather  random  patterns  in  geometric  relationships  and  tend  to 
overlap  or  be  very  close  to  each  other.  As  shown  in  Fig.  5(b),  the  noise  block  labeled 
“Center”  is  overlapped  with  block  1,  2,  3,  and  is  very  close  to  block  4.  Clique  Cn  is 


17 


defined  primarily  for  noise  blocks.  Similarly,  the  distance  between  two  blocks  is  defined 
as 

D(iJ)  =  max(Dh(i,j),Dv(i,j))  (27) 

where  Dh(i,j)  =  max(Li,Lj)  —  min (Ri,Rj),  Dv(i,j)  =  max(T).  T?)  —  min and  L, 
12,  T,  B  are  the  left,  right,  top,  and  bottom  coordinates  of  the  corresponding  blocks.  If 
two  blocks  overlap  in  the  horizontal  or  vertical  direction,  then  Dh(i ,  j)  <  0  or  Dv(i,  j )  <0. 
Blocks  i  and  j  are  connected  if  and  only  if  D(i,j )  <  Xn,  where  Tn  is  a  threshold.  If  Tn 
is  too  big,  incorrect  label  flips  of  noise  and  handwriting  between  two  printed  text  lines 
may  happen.  If  Tn  is  too  small,  the  contextual  constraints  on  the  noise  blocks  cannot  be 
fully  used.  We  set  Tn  as  half  of  the  dominant  character  height  (about  10  pixels  in  our 
experiments).  Each  node,  together  with  all  nodes  connected  to  it,  defines  clique  Cn.  The 
number  of  connected  nodes  may  vary  from  0  to  about  10,  depending  on  the  size  of  the 
block.  As  an  approximation,  we  consider  only  the  first  four  nearest  connected  neighbors. 
If  the  number  of  neighbors  is  less  than  four,  we  set  the  corresponding  neighbors  to  NULL. 

The  geometric  constraint  on  handwriting  has  weaker  horizontal  or  vertical  structure 
than  machine  printed  words,  thus  is  partially  reflected  in  both  cliques  Cp  and  Cn.  There¬ 
fore  we  do  not  define  a  specific  clique  for  handwriting. 

4.3  Clique  Potential 

Clique  potential  is  the  energy  associated  with  a  clique.  Generally,  we  assign  high  energy 
to  an  undesirable  configuration  of  the  clique  and  low  energy  to  a  preferred  configuration. 
For  example,  an  undesired  configuration  of  clique  Cp  (as  shown  in  Figure  5  (a))  is  that  the 
left  and  right  blocks  are  labeled  as  printed  text  and  the  center  block  as  noise.  Flipping 
the  label  of  the  center  block  from  noise  to  printed  text  would  achieve  a  more  preferred 
configuration,  and  reduce  the  total  energy.  Another  undesirable  configuration  is  that 
all  blocks  are  labeled  as  printed  text  for  the  clique  Cn  in  Figure  5  (b).  It  should  have 
higher  energy  than  the  configuration  in  which  all  blocks  are  labeled  as  noise.  In  many 
applications  the  clique  potentials  are  defined  in  ad  hoc  ways.  One  systematic  way  is 
to  define  clique  potential  as  the  occurrence  frequency  of  each  clique  in  the  training  set, 
which  can  be  expressed  as  a  function  of  local  conditional  probabilities.  Based  on  this 
idea,  we  define  two  clique  potentials  Vp(c )  and  14(c)  for  cliques  Cp  and  Cn  as 

y  (  ,  =  P(XhXc,Xr) 

p[c)  (P(Xl)P(Xc)P(Xr))-> 

(  ,  =  P{XC,  A4,  x2,  x3,  X4) 

n[c)  (P(Xc)P(Xi)P(X2)P{X3)P(Xi))w 
where  X/.  Xc  and  Xr  are  labels  for  the  left,  center,  and  right  blocks  of  clique  c,  w  is  a 
constant,  and  Xt ,  i  =  1,2,  3,  4,  is  the  label  of  the  7tli  nearest  block.  The  energy  of  the 
corresponding  Gibbs  distribution  is 

U(X/Y)  =  ws  'Z'i-PMy*)}  +%E  Vp(c)  +  wn  £  14(c)  (30) 

sen  cecp  cecn 

where  ws,  wp,  and  wn  are  weights  which  adjust  the  relative  importance  of  classification 
confidence  and  contextual  information  for  cliques  Cp  and  Cn.  If  ws  =  1,  wp  =  0,  and 
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wn  =  0,  no  contextual  information  is  used;  with  increase  in  wp  and  wn,  more  contex¬ 
tual  information  is  emphasized.  If  we  set  wp  =  wn  =  oo,  or  equivalently  ws  =  0,  no 
classification  confidence  is  used. 

In  the  following  experiments,  we  want  to  use  MRFs  for  word  block  labeling.  The 
number  of  handwritten  words  is  much  smaller  than  that  of  the  other  two  types,  leading 
to  a  lower  estimated  frequency  of  cliques  with  handwriting.  As  a  result,  the  optimization 
tends  to  label  handwritten  words  as  machine  printed  text  or  noise.  Therefore,  we  regular¬ 
ize  the  estimated  clique  frequency  P(Xt,Xc,Xr)  and  P{XC,  Ad,  X2,  -Y3,  X4)  by  dividing 
by  the  product  of  the  probabilities  of  the  word  block  labels  which  compose  the  clique. 
The  above  regularization  is  very  similar  to  the  previous  approach  [48],  where  w  is  set  to  1. 
In  our  case,  w  is  changeable;  increasing  w  will  emphasize  handwritten  words.  Our  clique 
potential  definition  is  very  systematic,  and  can  be  optimized  for  different  applications. 

After  defining  the  cliques  and  the  corresponding  clique  potential,  we  can  search  the 
optimal  configuration  of  the  labels  of  all  blocks,  so  that  the  total  energy  of  the  corre¬ 
sponding  Gibbs  distribution  is  minimized.  Relaxation  algorithms  are  often  used  for  MRF 
optimization.  There  are  two  types  of  relaxation  algorithms:  stochastic  and  determinis¬ 
tic  [7].  Stochastic  algorithms  can  always  converge  to  the  global  optimal  solution  if  some 
constraints  are  satisfied.  They  are,  however,  computationally  demanding.  Deterministic 
algorithms  are  simpler,  but  only  converge  to  local  optimal  solutions  depending  on  the 
initial  value.  In  our  experiments,  Highest  Confidence  First  (HCF),  a  deterministic  ap¬ 
proach,  is  used  for  MRF  optimization  due  to  its  fast  speed  and  good  performance  [49]. 
The  HCF  algorithm  finds  a  block  such  that  the  flipping  of  its  label  to  another  label  would 
reduce  the  total  energy  largest,  and  then  flips  its  label  to  the  desired  one.  It  repeats  this 
procedure  until  no  single  flipping  can  further  reduce  the  total  energy.  Since  each  flipping 
would  reduce  the  energy  and  the  energy  is  bounded  below,  the  HCF  algorithm  converges 
in  a  finite  number  of  steps.  Fig.  6  is  an  example  of  the  refined  classification  results  after 
post-processing.  Compared  with  Fig.  4,  we  can  see  in  Figs.  6(a)  and  (b)  that  most 
misclassified  noise  blocks  are  corrected,  with  a  few  exceptions  due  to  their  having  fewer 
constraints.  The  misclassified  small  machine  printed  words  are  all  corrected  in  Fig.  6(c). 

5  Experiments 
5.1  Data  Set 

We  collected  a  total  of  318  business  letters  from  the  tobacco  industry  litigation  archives. 
These  document  images  are  noisy  with  a  lot  of  handwritten  annotations  and  signatures, 
few  logos,  and  no  figures  or  tables.  Currently,  we  identify  three  classes:  machine  printed 
text,  handwriting,  and  noise.  Since  the  groundtruthing  of  each  word  block  in  the  images 
of  the  entire  database  would  be  time  consuming,  we  only  did  it  for  94  extremely  noisy 
document  images.  These  94  images  are  used  for  testing,  and  the  other  224  images  for 
training.  All  handwritten  words  (about  1,500)  in  the  training  set  are  groundtruthed. 
Since  there  is  much  more  machine  printed  text  and  noise,  we  randomly  selected  and 
groundtruthed  about  the  same  number  of  samples  of  each  type  in  the  training  set.  We 
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Figure  6:  Word  block  classification  results  after  post-processing.  The  result  before  post¬ 
processing  is  shown  in  Fig.  4.  (a)  The  whole  document  image,  (b)  and  (c)  two  parts  of 
the  image  in  (a). 
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Table  2:  Performance  comparison  of  three  different  classifiers:  the  k-NN  classifier,  the 
Fisher  classifier,  and  the  SVM  classifier.  In  the  table,  Acc  means  for  accuracy,  and  Var 


means  variance. 


#  of  blocks 

the  k-NN  classifier 

the  Fisher  classifier 

the  SVM  classifier 

Correct 

Acc 

Var 

Correct 

Acc 

Var 

Correct 

Acc 

Var 

Printed  text 

1,519 

1,489 

98.0% 

1.4% 

1,473 

97.0% 

1.1% 

1,480 

97.4% 

1.2% 

Handwriting 

1,518 

1,390 

91.6% 

2.3% 

1,410 

92.9% 

2.2% 

1,435 

94.5% 

2.1% 

Noise 

1,512 

1,406 

93.0% 

2.0% 

1,451 

96.0% 

1.5% 

1,453 

96.1% 

1.2% 

Overall 

4,549 

4,285 

94.2% 

1.3% 

4,344 

95.5% 

0.9% 

4,368 

96.0% 

0.9% 

Table  3:  Single  word  block  classification 


#  of  blocks 

Percentage 

#  of  correctly 
classified  blocks 

#  of  misclassified 
blocks 

Accuracy 

Precision 

Printed  text 

19,227 

66.9% 

18,446 

781 

95.9% 

99.5% 

Handwriting 

701 

2.4% 

653 

48 

93.2% 

62.9% 

Noise 

8,802 

30.7% 

8,522 

280 

96.8% 

93.0% 

Overall 

28,730 

100.0% 

27,621 

1,109 

96.1% 

N/A 

use  accuracy  and  precision  as  metrics  to  evaluate  the  result: 


Accuracy  of  type  i 


#  of  correctly  classified  blocks  of  type  i 
#  of  blocks  of  type  i 


Precision  of  type  i 


#  of  correctly  classified  blocks  of  type  i 
#  of  blocks  classified  as  type  i 


(31) 

(32) 


5.2  Classifier  Comparison 

In  this  section,  we  compare  the  performance  of  three  different  classifiers:  the  k-NN 
classifier,  the  Fisher  classifier,  and  the  SVM  classifier.  The  SVM  classifier  is  based  on 
VC  dimension  theory  and  structural  risk  minimization  theory  of  statistical  learning  [50] . 
A  public  domain  SVM  tool,  LibSVM,  is  used  in  the  following  experiment  [51].  The  N-fold 
verification  technique,  a  variation  of  the  leave-one-out  technique,  is  used  to  estimate  the 
classification  accuracy.  Instead  of  holding  one  sample  for  testing  at  each  iteration,  it  first 
divides  the  data  set  into  N  groups  (N  =  10  in  our  experiment),  and  then  holds  one  group 
of  samples  for  testing  and  the  remaining  groups  for  training.  The  classification  accuracies 
of  all  the  classifiers  are  shown  in  Table  2.  We  can  see  that  the  SVM  classifier  achieved 
the  highest  accuracy.  Considering  the  large  variance,  the  improvement  is  not  significant. 
The  variance  of  the  classification  accuracy  of  all  classifiers  is  the  smallest  for  printed 
text,  and  the  largest  for  handwriting,  indicating  that  the  printed  text  is  more  compact 
in  the  feature  space.  Among  all  three  classifiers,  the  Fisher  classifier  is  the  fastest  since 
only  one  vector  multiplication  is  needed  to  perform  a  classification.  Therefore,  we  use 
the  Fisher  classifier  for  the  rest  of  experiments. 

The  classification  result  on  the  test  set  of  94  images,  using  the  Fisher  classifier,  is 
shown  in  Table  3.  The  accuracies  on  all  three  classes  range  from  93.2%  to  96.8%,  with 
the  overall  accuracy  96.1%.  While  this  overall  accuracy  is  very  high,  we  notice  that  the 
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Figure  7:  MRF-based  post-processing,  (a)  Number  of  corrected  blocks  using  clique  Cp. 
(b)  Number  of  corrected  blocks  using  clique  Cn.  (c)  Number  of  corrected  blocks  using- 
clique  Cp  and  classification  confidence. 


precision  for  handwriting  is  very  low  (63.9%).  This  is  mainly  because  of  the  small  number 
of  handwritten  words  in  the  testing  set.  Even  small  percentages  of  misclassification  of 
machine  printed  text  and  noise  as  handwriting  will  significantly  decrease  the  precision  of 
handwriting. 

5.3  Post-processing  Using  MRFs 

In  the  following  experiments  we  investigate  how  MRFs  can  improve  classification  accu¬ 
racy.  In  the  first  run,  we  set  ws  =  0  ,  wn  =  0  and  wp  =  1  to  show  the  effectiveness  of 
clique  Cp.  Fig.  7(a)  shows  the  number  of  corrected  blocks,  which  were  previously  mis- 
classified,  with  change  in  w.  As  expected,  Cp  is  very  effective  for  machine  printed  words, 
but  not  so  effective  for  handwriting  and  noise.  When  w  =  0.3  (under  this  condition,  the 
classification  accuracy  of  all  three  classes  increases),  355  (46%)  of  the  previously  mis- 
classified  machine  printed  words  are  corrected.  When  w  increases,  handwriting  is  more 
emphasized,  leading  to  higher  classification  accuracy  of  handwriting,  and  lower  accuracy 
of  machine  printed  words  and  noise.  In  practice,  w  can  be  adjusted  to  optimize  the 
overall  accuracy. 

In  the  second  run,  we  test  the  effectiveness  of  clique  Cn  by  setting  ws  =  0,  wp  =  0, 
and  wn  =  1.  As  shown  in  Fig.  7(b),  clique  Cn  is  very  effective  in  correcting  classification 
errors  of  noise  blocks.  The  classification  error  of  noise  blocks  is  greatly  reduced  when 
w  is  small.  For  w  =  0.6  (under  this  condition,  the  classification  accuracy  of  all  classes 
increases),  the  number  of  misclassified  noise  blocks  is  reduced  by  99  (35%).  Cn  can  also 
correct  some  classification  errors  of  machine  printed  words,  but  is  less  effective  than  Cp 
as  shown  in  Fig.  7(a). 

The  third  run  tests  the  effectiveness  of  classification  confidence  for  post-processing. 
Fig.  7(c)  shows  post-processing  results  by  adjusting  wp  when  w  =  0.3,  wn  =  0,  and 
ws  =  1.  Adjusting  iup  will  change  the  total  flip  number  greatly.  When  wp  =  0,  the 
energy  reaches  the  minimum  with  the  initial  labels,  and  the  total  flip  number  is  0.  When 
wp  increases,  more  emphasis  is  put  on  the  contextual  information,  and  the  flip  number 
increases.  When  wp  — >■  +oo,  it  converges  to  the  case  of  wp  =  1  and  ws  =  0,  the  setting 
of  the  first  run.  The  maximal  overall  classification  accuracy  is  achieved  when  wp  =  6. 
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Table  4:  Word 


block  classification  after  MRF  based  post-processing 


#of 

blocks 

#of 

correctly 

classified 

blocks 

#  of  mis¬ 
classified 
blocks 

Reduction 
of  mis¬ 
classified 
blocks 

Error 

reduction 

rate 

Accuracy 

Precision 

Printed  text 

19,227 

18,835 

392 

389 

49.8% 

98.0% 

99.7% 

Handwriting 

701 

652 

49 

-1 

-2.1% 

93.0% 

83.3% 

Noise 

8,802 

8,682 

120 

160 

57.1% 

98.6% 

96.0% 

Total 

28,730 

28,169 

561 

548 

49.4% 

98.1% 

N/A 

Compared  with  the  first  run,  the  total  number  of  corrected  blocks  increases  from  389  to 
424  by  incorporating  classification  confidence.  Similar  results  are  achieved  by  combining 
classification  confidence  with  clique  Cn. 

In  the  last  run,  we  fix  ws  =  1  and  manually  adjust  w,  wp,  and  wn  to  optimize  the 
overall  classification  accuracy.  The  final  parameters  we  chose  are  w  =  0.39,  wp  =  5,  and 
wn  =  4.  Table  4  shows  the  results  after  post-processing.  The  “Error  Reduction  Rate”  in 
Table  4  is  defined  as  follows: 

Error  #  of  Errors  Before  Post-Processing  —  #  of  Errors  After  Post-Processing 

^Rate  °n  #  °f  Error  Before  Post-Processing 

. M!,!  ,  ^  ^  (33) 

The  error  rate  reduces  to  about  half  of  the  original  for  both  machine  printed  text 
and  noise,  but  increases  slightly  for  handwriting.  However,  compared  with  Table  3,  the 
precision  of  handwriting  increases  from  62.9%  to  83.3%  due  to  fewer  machine  printed  text 
and  noise  misclassifications  as  handwriting.  The  overall  accuracy  increases  from  96.1% 
to  98.1%. 

Fig.  8  shows  another  example  of  machine  printed  text  and  handwriting  identification 
from  noisy  documents.  To  display  the  classification  results  clearly,  we  decompose  the 
classified  image  into  three  layers,  representing  machine  printed  text  (Fig.  8(b)),  hand¬ 
writing  (Fig.  8(c)),  and  noise  (Fig.  8(d))  respectively.  The  result  is  good  with  very  few 
misclassifications. 

Our  approach  is  very  general,  and  can  be  extended  to  other  languages  with  minor 
modification.  Fig.  9  shows  identification  results  for  a  Chinese  document.  In  Chinese, 
there  is  no  clear  definition  of  words  and  no  spaces  between  neighboring  words.  Therefore, 
the  parameters  of  our  word  segmentation  module  are  adjusted  to  get  characters.  We  only 
need  to  retrain  the  classifiers;  the  post-processing  module  is  intact.  We  can  see  that  most 
handwriting  and  noise  blocks  are  classified  correctly,  but  several  machine  printed  digits 
are  misclassified  as  handwriting.  On  the  right  margin  of  the  document,  some  machine 
printed  text  is  identified  as  noise  due  to  touching. 

Our  approach  is  fast;  the  averaging  processing  time  for  a  business  letter  scanned  at 
300  DPI  is  about  2-3  seconds  on  a  PC  with  1.7  GHZ  CPU  and  1.0  GMB  memory. 
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Figure  8:  An  example  of  machine  printed  text  and  handwriting  identification  from  noisy 
documents,  (a)  The  original  document  image,  (b)  machine  printed  text,  (c)  handwriting, 
(d)  noise.  The  logo  is  classified  as  noise  since  currently  we  only  consider  three  classes. 
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Figure  9:  An  example  of  machine  printed  text  and  handwriting  identification  from  Chi 
nese  documents,  (a)  Original  Chinese  document  image,  (b)  machine  printed  text,  (c 
handwriting,  (d)  noise. 
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5.4  Page  Segmentation  in  Noisy  Images 

In  this  experiment  we  show  that  our  method  can  improve  general  page  segmentation 
results  after  removing  identified  noise.  We  evaluated  two  widely  used  zone  segmentation 
algorithms:  the  Docstrum  algorithm  [2]  and  ScanSoft  SDK,  a  commercial  OCR  software 
package  [3].  Many  different  zone  segmentation  evaluation  metrics  have  been  proposed  in 
previous  work.  Kanai  et  al.  [52]  evaluated  zone  segmentation  accuracy  from  the  OCR 
aspect.  Any  zone  splitting  and  merging,  if  it  does  not  affect  the  reading  order  of  the 
text,  is  not  penalized.  The  approach  of  Mao  et  al.  is  based  on  text  lines,  which  penalizes 
only  horizontal  text  line  splitting  and  merging,  since  it  will  change  the  reading  order 
of  text  [53].  Randriamasy  et  al.  [54]  proposed  an  evaluation  method  based  on  multiple 
ground  truth,  which  is  very  expensive.  Liang’s  approach  is  performed  at  the  zone  level 
[30].  After  finding  the  correspondence  between  the  segmented  and  groundtruthed  zones, 
any  large  enough  difference  is  penalized.  We  use  Liang’s  scheme  in  our  experiment  since 
we  focus  more  on  zone  segmentation.  From  the  OCR  perspective,  vertical  splitting  or 
merging  of  different  zones  should  not  be  penalized  even  when  these  zones  have  different 
physical  and  semantic  properties,  but  from  the  point  view  of  zone  segmentation,  it  should 
be  penalized. 

There  are  1,374  machine  printed  text  zones  in  94  noisy  document  images.  The  ex¬ 
perimental  results  are  shown  in  Table  5.  All  merging  and  splitting  errors  are  counted  as 
partially  correct  in  the  table.  Before  noise  removal,  ScanSoft  gets  very  poor  results,  with 
an  accuracy  of  15.9%,  on  noisy  documents  under  this  metric.  After  analyzing  the  seg¬ 
mentation  results,  we  found  that  ScanSoft  tends  to  merge  horizontally  arrayed  zones  into 
one  zone,  which  is  suitable  for  documents  with  simple  layouts  such  as  technical  articles, 
but  not  suitable  for  other  document  types  such  as  business  letters.  The  Docstrum  algo¬ 
rithm  outputs  many  more  zones  than  ScanSoft,  resulting  in  a  higher  accuracy  (53.0%), 
but  also  a  higher  false  alarm  rate  (114.1%).  After  noise  removal,  the  accuracy  of  both 
algorithms  increases  significantly,  from  15.9%  to  48.4%  for  ScanSoft  and  from  53.0%  to 
78.0%  for  the  Docstrum  algorithm.  The  false  alarm  rate  is  reduced  from  32.5%  to  1.3% 
for  ScanSoft  and  from  114.1%  to  7.9%.  for  Docstrum. 

Fig.  10  shows  the  zone  segmentation  results  for  two  noisy  documents  with  the  Doc¬ 
strum  algorithm  before  and  after  noise  removal.  The  handwriting  is  output  to  another 
layer  which  is  not  shown  here.  We  can  see  that  after  noise  removal,  there  are  many 
fewer  splitting  and  merging  errors,  and  overall  the  segmentation  results  are  significantly 
improved. 

6  Summary 

In  this  paper,  we  have  presented  an  approach  to  segmenting  and  identifying  text  from 
extremely  noisy  document  images.  Instead  of  using  simple  filtering  rules,  we  treat  noise 
as  a  distinct  class,  and  use  statistical  classification  techniques  to  classify  each  block  into 
machine  printed  text,  handwriting,  and  noise.  We  then  use  Markov  Random  Fields  to 
incorporate  contextual  information  for  post-processing.  Experiments  show  that  MRFs 
are  a  very  effective  tool  for  modeling  local  dependency  among  neighboring  image  com¬ 
ponents.  After  post-processing,  the  classification  error  rate  is  reduced  by  approximately 
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Figure  10:  Zone  segmentation  before  and  after  noise  removal  using  the  Docstrum  algo¬ 
rithm.  (a)  and  (c)  show  the  results  before  noise  removal,  (b)  and  (d)  are  the  results  after 
noise  removal. 
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Table  5:  Machine  printed  zone  segmentation  experimental  results  on  94  noisy  document 
images  (totally  1,374  zones),  before  and  after  noise  removal. 


Before  noise  removal 

After  noise  removal 

Correctly 

segmented 

zones 

False 

alarm 

zones 

Partially 

correctly 

segmented 

zones 

Missed 

zones 

Correctly 

segmented 

zones 

False 

alarm 

zones 

Partially 

correctly 

segmented 

zones 

Missed 

zones 

ScanSoft 

219 

(15.9%) 

446 

(32.5%) 

1148 

(83.7%) 

7 

(0.5%) 

665 

(48.4%) 

18 

(1.3%) 

671 

(48.8%) 

38 

(2.8%) 

Docstrum 

728 

(53.0%) 

1568 

(114.1%) 

646 

(47.0%) 

0 

(0.0%) 

1071 

(78.0%) 

109 

(7.9%) 

270 

(19.7%) 

(2A%) 

50%.  Our  method  is  general  enough  to  be  extended  to  documents  in  other  languages. 
The  technique  presented  in  this  paper  can  be  used  for  image  enhancement  to  improve 
page  segmentation  accuracy  of  noisy  documents.  After  noise  identification  and  removal, 
the  zone  segmentation  accuracy  increase  from  53%  to  78%  using  the  Docstrum  algorithm. 

Currently  our  clique  potential  definition  considers  only  the  labels  of  each  block  in¬ 
side  the  clique,  which  may  lose  useful  information.  For  example,  for  clique  Cp ,  a  clique 
of  three  printed  words  with  roughly  the  same  height  is  quite  different  from  one  with 
different  heights.  In  the  latter  case,  it  is  possible  that  one  of  the  blocks  is  erroneously 
identified.  Another  potential  improvement  is  to  integrate  high-level  contextual  informa¬ 
tion  in  addition  to  the  local  contextual  information  that  we  used.  For  example,  the  text 
line  and  zone  segmentation  results  can  be  fed  back  to  our  classification  module  to  refine 
the  classification.  Effective  use  of  contextual  information  is  one  of  our  future  research 
directions. 
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